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Abstract. Many structured data-fitting applications require the solution of an optimization 
problem involving a sum over a potentially large number of measurements. Incremental gradient 
algorithms offer inexpensive iterations by sampling a subset of the terms in the sum; these methods 
can make great progress initially, but often slow as they approach a solution. In contrast, full-gradient 
methods achieve steady convergence at the expense of evaluating the full objective and gradient on 
each iteration. We explore hybrid methods that exhibit the benefits of both approaches. Rate-of- 
convergence analysis shows that by controlling the sample size in an incremental-gradient algorithm, 
it is possible to maintain the steady convergence rates of full-gradient methods. We detail a practical 
quasi-Newton implementation based on this approach. Numerical experiments illustrate its potential 
benefits. 

1. Introduction. Data-fitting applications are often typified by optimization 
problems of the form 



where each function fi corresponds to a single observation or measurement, and models 
the misfit for a given choice of parameters x. The aim is to choose parameters that 
minimize the misfit (or loss) across all measurements. The canonical example is least 
squares, and in that case, 



This misfit model corresponds to a linear model with Gaussian errors on the measure- 
ments b. For applications where the measurements b are binary, a more appropriate 
model is logistic regression, described by the choice 



These are both special cases of the more general maximum-likelihood problem. The 
maximum- likelihood approach gives rise to separable problems like (1.1) whenever the 
measurements {ai,bi) are assumed to be independent and identically distributed. 

If the number of measurements M is very large, or if the individual fi are compli- 
cated functions (e.g., each fi evaluation may require the solution of a partial differential 
equation), then evaluating f{x) and V/(a;) can be computationally expensive. How- 
ever, there is often a large amount of uniformity in the measurements, which means 
that a full evaluation of /(x) and V f{x) may be unnecessary to make progress in 
solving (1.1). This motivates incremental- gradient methods, in which each iteration 
only evaluates the gradient with respect to a single fi [3, §3.2]. 
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minimize 




(1.1) 



fi{x)^\{ajx-b,f for all i = l,...,Af. 



fi{x) ^\og{l + eyiY>[~b^aJx\) for all i = l,...,M. 
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Incremental-gradient methods enjoy an iteration cost that is M times faster than 
full-gradient methods because the iterations are independent of M . Thus, in the 
time it takes to make one full-gradient iteration, the incremental-gradient method 
can achieve M iterations, which often results in rapid initial progress. However, the 
number of iterations needed to reach the same level of accuracy may be much higher. 
Indeed, because of their faster convergence rate, full-gradient methods must eventually 
dominate incremental-gradient methods. 

Our aim is to develop a method that exhibits the benefits of these two extremes. 
The approach is based on starting with iterations that resemble an incremental-gradient 
approach and use relatively few measurements to approximate the gradient; as the 
iterations proceed, the algorithm gradually increases the number of measurements. 
This preserves the rapid initial progress of incremental-gradient methods without 
sacrificing the convergence rate of full-gradient methods. 

1.1. Gradient descent with error. In the most basic version of the algorithm 
that we consider, each iterate is computed via the update 

Xk+i = Xk - akgk (1-2) 

for some step size a^; the search direction 

gk ■= ^f{xk) + efc (1.3) 

is an approximation of the gradient, and Ck is the residual in its computation. This 
framework is the point of departure in the analysis of §§2-3. Evidently, the full gradient 
method (i.e., steepest descent) corresponds to the case where = 0, which means 
that the gradient is exactly computed at each iteration. 

The stochastic approximation method deals with the case where the residual 
is a random variable; see [3]. In this context we typically assume that E[efc] = 
and E[|jefe|p] < B for some constant which are sufficient to guarantee that the 
iterates converge in a probabilistic sense (for a suitable choice of decreasing step sizes) . 
Incremental-gradient methods are a special case of stochastic approximation. Here, 
rather than computing the full gradient ^ f{xk) on each iteration, a function fi is 
randomly selected among z S {1, . . . , M}, and the gradient estimate is constructed as 
gk = ^fi[xk). 

In this work we consider scenarios in which the gradient residual can be controlled 
on each iteration, by specifying a per- iteration bound Bk on the norm of the residual. In 
particular, we characterize convergence of the gradient-with-error algorithm (1.2)-(1.3) 
under two different conditions: 

deterministic: ||e/c|p < Bk] or stochastic: E[|lefc|p] < Bk- (1.4) 

Our analysis applies whether the noise is deterministic or stochastic, and we do not 
assume that the noise has zero mean. In the context of problem (1.1), the error in the 
gradient is a result of using a sample of the fi functions (sometimes referred to as a 
hatch). In §3 we show how the sample size (or hatch size) infiuences the bound Bk- 

We also consider in §4 the case in which the approximate gradient gk is scaled to 
account for curvature information in /. In this case, the iteration update is 

Xk+i = Xk~ akdk, (1.5) 

where dk solves the system 

Hkd = -gk, (1.6) 
and Hk is a positive-definite approximation (e.g., a quasi-Newton Hessian) to V^/. 
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1.2. Assumptions and notation. We make the blanket assumption throughout 
that a minimizer of / always exists, that the functions fi : M" — > M are continuously 
differentiable, and that the overall gradient of / is uniformly Lipschitz continuous, i.e., 
for some positive L, 

\\^f{x)^Wf{y)\\<L\\x-y\\ for ah x,yeR". (1.7a) 

We also assume that / is strongly convex with (positive) parameter /i: 

f{y)>f{x) + (y-xfWf{x)+^^fi\\y-xf for ah x,yeM". (1.7b) 

If / is twice-continuously differentiable, then these assumptions are equivalent to the 
condition that the eigenvalues of the Hessian are uniformly bounded above and below: 

^ y'^fix) ^ LI. 

The ratio L/fj, > 1 is known as the condition number of / [26, §2.1.3]. 

We make the assumption — standard in the stochastic optimization literature [3, 
§4.2]— that 

||V/,(a;)f </3i+/?2||V/(x)||2 for all x and i = 1, . . . , M, (1.8) 

for some constants /3i > and 1^2 > 1. This implies that ||efc|| is bounded as a function 
of the true gradient of the objective. 

We describe two versions of linear convergence of function values. The first is 
denoted as weak linear convergence, and is characterized by a bounding sequence on 
the function value at every iteration k: 

f{xk) ~ f{x^) =0{(J^) for some ct < 1. (1.9) 

This is a non-asymptotic version of R-linear convergence. (See, for example, [28, §A.2].) 
The second is denoted as strong linear convergence, and characterizes the decrease of 
the function value at every iteration k: 

f{xk+i)-f{x^)<a[f{xk)-f{x^)] for some a<l. (1.10) 

We emphasize that this inequality applies to all iterations of the algorithm, and is a 
non- asymptotic version of Q-linear convergence. Note that (1.10) implies (1.9), though 
the converse is not implied. 

All of the convergence results that we analyze are described only in terms of 
convergence of the objective function values, and not the iterates Xk- This is sufficient, 
however, because the strong convexity assumption allows us to directly deduce a 
convergence rate of Xk to via the corresponding function values. In particular, 
strong convexity of / implies 

^\\xk-x,f < f{xk)- fix,). (1.11) 

Thus, the rate at which the squared error ||a;fe — converges is at least as fast as 
the rate at which f{xk) converges to the optimal value /(a;*). 

If the matrix Hk in the iterations (1.5)-(1.6) is uniformly positive definite and 
uniformly bounded in norm (as can be enforced in practice), then assumptions (1.7a)- 
(1.7b) can be replaced by the following conditions: there exist positive constants L' 
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and jj! such that for aU x,y E M" and for aU fc = 0, 1, . . ., 

l|V/(a;) - V/(y)|| < L'\\x - yWn, (1.12a) 

k 

f{y) > fix) + {y- xfVfix) + ^fi'Wx - y\\l^, (1.12b) 

where the quadratic norm ||a;|j^fj. = \/ x'^HkX and its dual |ja;||j:^-i are used instead of 
the EucUdean norm. It can then be verified that all of the results in §§2-3 apply to 
the Newton-like algorithm (1.5)-(1.6), where the parameters L and /i in those results 
are replaced by the parameters L' and /i' found in (1.12). The benefit of this approach 
is that a judicious choice of the scaling can lead to a scaled condition number 
L' I p! that can be smaller than the condition number L//i of the unsealed objective /, 
effectively improving on the error constants found in the convergence results. 

1.3. Contributions. This paper is divided into six components. 

Weak linear convergence with generic bounds (§2.1). We analyze the convergence 
rate under a generic sequence {Bk}- Our results imply that for any (sub)linearly 
decreasing sequence {Bk}, the algorithm lias a weak (sub)linear convergence rate. In 
the expected-error version of (1.4), the convergence rate is described in terms of the 
expected function value. 

Strong linear convergence with particular bounds (§2.2). We describe a particular 
construction of the sequence {-Bfc} that ensures that the algorithm has a strong linear 
convergence rate. The rate achieved under this sequence can be arbitrarily close to 
the rate of the standard gradient method without error, without requiring an exact 
gradient calculation on any iteration. 

Sublinear convergence without strong convexity (§2.3). Without the strong convex- 
ity assumption on /, the convergence rate for the deterministic gradient method is 
sublinear. We show that a summable sequence {Bk} is sufficient to maintain tlie same 
sublinear rate. 

Application to sample-average gradients (§3). For data-fitting problems of the 
form (1.1), we show that a growing sample-size strategy can be used as a mechanism 
for controlling the error in the estimated gradient and achieving a linear rate. In effect, 
choosing the sample size allows us to control the error, as in (1.4). By growing the 
sample size sufficiently fast, we implicitly set the rate at which Bk — t- 0, and hence the 
overall rate of the algorithm. 

A practical quasi-Newton implementation (§4). We describe a practical implemen- 
tation of the ideas based on a limited-memory quasi-Newton approximation and a 
heuristic line search. 

Numerical results (§5). We evaluate the implementation on a variety of data- 
fitting applications (comparing to incremental gradient methods and a deterministic 
quasi-Newton method). 

1.4. Related work. Our approach is based on bridging the gap between two ends 
of a spectrum, where incremental gradient methods are at the end of "cheap iterations 
with slow convergence" , and full gradient methods are "expensive iterations with 
fast convergence" . In particular, incremental-gradient methods achieve an expected 
sublinear convergence rate on the expected value of f{xk), i.e., 

E[f{xk)-f{x,)]=0{l/k), 

where the iterations are described by (1.2) for — 0{l/k) [25, §2.1]. In fact, among 
all first-order methods, this is the best possible dependency on k given only a first-order 
stochastic oracle; thus a linear rate is not possible [24, §14.1]. 
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In contrast, consider the basic gradient-descent iteration (1.2) with a fixed step size 
ttfc = 1/i and = V/(a;fe). It is well known that this algorithm has a strong linear 
convergence rate, and satisfies the per-iteration decrease in (1.10) with cr = 1 — /i/L; 
see [17, §8.6]. 

Some authors have analyzed incremental gradient methods with a constant step 
size [23]. Although this strategy does not converge to the optimal solution, it does 
converge at a linear rate to a neighborhood of the solution (where the size of the 
neighborhood increases with the step size). 

In the context of incremental gradient methods, several other hybrid methods 
have been proposed that achieve a linear convergence rate. The works of Bertsekas [2] 
and Blatt, Hero, and Gauchman [6] are the closest in spirit to our proposed approach. 
However, the convergence rates for these methods treat full passes through the data as 
iterations, similar to the full gradient method. Further, there are numerical difficulties 
in evaluating certain sequences associated with the method of Bertsekas, while the 
method of Blatt et al. may require an excessive amount of memory. 

This is not the first work to examine a growing sample-size strategy. This type of 
strategy appears to be a "folk" algorithm used by practitioners in several application 
domains, and it is explicitly mentioned in an informal context by some authors such 
as Bertsekas and Tsitsiklis [3, page 113], who refer to growing sample-size techniques 
as "batching" strategies. This is the first work, that we are aware of, that presents a 
theoretical analysis of the technique and that proposes a practical large-scale quasi- 
Newton implementation along with an experimental evaluation. 

Gradient descent with a decreasing sequence of errors in the gradient measurement 
was previously analyzed by Luo and Tseng [18], and they present analogous weak and 
strong linear convergence results depending on the sequence of bounds on the noise. 
Our analysis extends this previous work in several ways: 

• The weak linear convergence rate described in [18] requires a per-iteration 
strict decrease in the objective. In contrast, our analysis in §2.1 does not 
require this assumption, and allows for the (realistic) possibility that the noisy 
gradient can lead to an increase in the objective function on some iterations. 

• The strong linear convergence rate shown in [18] only holds asymptotically, 
while the construction we give (see §2.2) leads to a non-asymptotic rate of the 
form (1.10) that applies to all iterations of the algorithm. 

• Luo and Tseng [18] consider deterministic errors in the gradient measurement 
that can be bounded in an absolute sense; we also consider the more general 
scenario where the error is stochastic and can only be bounded in expectation. 

1.5. Reproducible research. Following the discipline of reproducible research, 
the source code and data files required to reproduce the experimental results of this 
paper can be downloaded from 

http : //www. cs .ubc . ca/labs/ scl/FriedlcaiderSchmidt2011. 

2. Convergence analysis. Our convergence analysis first considers a basic first- 
order method with the constant step size ak — 1/L. The following intermediate result 
establishes an upper bound on the objective value at each iteration in terms of the 
residual in the computed gradient. 



Lemma 2.1. At each iteration k of algorithm (1.2), with ak = ^/L. 

f{xk+i) - fix.) < (1 - ^i/L)[fixk) - fix.)] + :^||efe|p. (2.1) 
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(2.3) 



Proof. It follows from assumptions (1.7) that the following inequalities hold: 

f{y) < fix) + {y~ xfWfix) + ^\\y~ x\\', (2.2a) 
f{y) > fix) + iy - x)^V/(x) + - . (2.2b) 

Use X = Xk and y = Xk — i\/L)gk in (2.2a) and simplify to obtain 

fixk - il/L)9{xk)) < fixk) - \gixk)'^Vfixk) + ^\\gixk)f. 
Next, use the definitions of Xk+i and gk (cf. (1.2)-(1.3)) in this expression to obtain 

fixk+i) < f{xk) - ^("^fixk) + ek)^Vfixk) + ^\\Vf{xk) + ekf 
= f{xk) - ^\\Vfixk)f ~ ^Vf{xk)^ek 

+ ^l|V/(x,)|p + ^Vfixk)^ek + ^Wckf 
-/(^fc)-^l|V/(xfc)f + ^||e,|p. 

We now use (2.2b) to derive a lower bound on the norm of V fixk) in terms of 
the optimality of f{xk)- Do this by minimizing both sides of (2.2b) with respect to y: 
by definition, the minimum of the left-hand side is achieved hy y = x^; the minimizer 
of the right-hand side is given by y a; — (l//i)V/(a;). Thus, 

fix.) > fix) -V/(x)^V/(x) + ^V/(a;)^V/(x) = f{x) ;^||V/(x)f 

for any x. Re-arranging and specializing to the case where x = Xk, 

||V/(a;fe)|p > 2^l[fixk)-fix,)]. (2.4) 
Subtract fix.) from both sides of (2.3) and use (2.4) to get 

fixk+i) - fix.) < fixk) - fix.) - ^[f{xk) ~ fix,)] + ^llefclP 
= {l-^,/L)[f{xk)~fix.)] + ^\\ek\\\ 

which gives the required result. □ 

As an aside, note that (2.3) shows that the objective decreases monotonically, i.e., 
fixk+i) < fixk) if llefcll < ||V/(xfc)|l. In general, however, we do not require this 
condition. 

2.1. Weak linear convergence. In this section we show that if {Bk} is any 
(sub)linearly convergent sequence, then algorithm (1.2) has a (sub)linear convergence 
rate. This result reflects that the convergence rate of the approximate gradient 
algorithm is not better than the rate at which the noise goes to zero, and of course is 
also not better than the rate of the noiseless algorithm. 
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Theorem 2.2 (Weak convergence rate under absolute error bounds). Suppose that 
||eft|P < Bk, where 

lim Bk+i/Bk < 1. (2.5) 

Then at each iteration of algorithm (1.2) with ak ^ 1/^; for any e > we have 

f{xk) - fix.) < (1 - f^/Lf[f{xo) - fix,)] + OiCk), 
where Ck — max{i3fe, (1 — /i/i + e)'^}. 

Proof. Let p := 1 ~ n/L. Because ||efc|p < Bk, Lemma 2.1 implies 

fixk+i) - fix,) < [fixk) - fix*)]p + j^Bk. 

Applying this recursively, 

fixk) - fix,) = ifixo) - fix,)]p'^ + Oiiik), 

where ^fc := Y^lZl p''^''~^Bi. Observe that fik+i = Bk + p^k- 

We now show that p.k < ^Cfc for all k for some ^ > 0. It follows from (2.5) and 
the definition of Ck that limfc_j.oo Ck+i/Ck > p + and thus that there exists some N 
such that Ck+i/Ck — p > e/2 for all k > N. We now choose ^ such that 

Pk < CCfe for all k<N, and £,e/2>l. 

This is always possible because the fj.k are finite and e > 0. We now show by induction 
that pk < £,Ck for all k. This trivially holds for all fc < by the definition of ^. 
Assuming this holds for some arbitrary k > N, we have 

Pk+i = Bk+ ppk < Ck + p£,Ck 

< a^/2)Ck + p^Ck < ack+i/Ck - p)Ck + p^Ck = 

Thus, Pk = OiCk) for all fc, as required. □ 

An implication of this result is that the algorithm has a linear convergence rate if 
Bk converges linearly to zero. For example, if Bk — 0(7*') with 7 < 1, then 

fixk)- fix,)^Oia''), 

where a = max{7, (1 — ^/L + e)}) for any positive e < p/ L. 

Theorem 2.2 also yields a convergence rate in scenarios where the high cost of 
computing an accurate gradient might make it appealing to allow the error in the 
gradient measurement to decrease sublinearly. For example, if = C(l/A:^), then 
fixk) — fix*) — ©(l/fc^), which is the rate achieved by Nesterov's optimal method for 
general smooth convex (but not necessarily strongly convex) functions in the noiseless 
setting [26, §2.1]. Theorem 2.2 also allows for the possibility that the bound Bk does 
not converge to zero (necessarily implying the limit of Bk+i/Bk is one). In this case, 
the result simply states that the distance to optimality is eventually bounded by a 
constant times Bk- 

The above analysis allows for the possibility that the approximate gradient is 
computed by a stochastic algorithm where the error made by the algorithm can be 
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bounded in an absolute sense. We now consider the more general case where the 
error can only be bounded in expectation. The following result is the counterpart to 
Theorem 2.2, where we instead have a bound on the expected value of ||eA;|p. 



Theorem 2.3 (Weak expected convergence rate under expected error bounds). 

Suppose that E[|lefe|p] < B^, where 

lim Bk+i/Bk < 1. 

Then at each iteration of algorithm (1.2) with ak = ^/L, for any e > we have 

E[f{xk) - fix,)] < (1 - mA)'=[/(xo) - fix,)] + OiCk), 
where Ck — maxjSfe, (1 — /i/L + e)*^}. 



Proof. We use Lemma 2.1 and take expectations of (2.1) to obtain 

E[/(xfe+i) - fix,)] < (1 - ^l/L)E[fixk) - fix,)] + ^E[||e,||2]. 

Proceeding as in the proof of Theorem 2.2, we obtain a similar result, but based on 
the expected value of the objective. □ 

2.2. Strong linear convergence. We now describe a particular construction 
of the sequence {Bk} that allows us to achieve a linear decrease of the function values 
that applies to every iteration k. This strong guarantee, however, comes at the price 
of requiring bounds on three quantities that are unknowable for general problems: the 
strong convexity constant /x, the gradient's Lipschitz constant L, and a non-trivial 
lower bound on the current iterate's distance to optimality fixk) ~ fix,). 

In particular, we consider any sequence {Bk} that satisfies 

0<Bk< 2Liii/L - p)nk, fc - 0, 1, . . . , (2.6) 

where p < fi/L is a positive constant, which controls the convergence rate, and tt^ is a 
non-negative lower bound on the distance to optimality, i.e., 

0<7rk<fixk)~fix,). (2.7) 



Theorem 2.4 (Strong linear rate under absolute error bounds). Suppose that 
||efe|p < Bk where Bk is given by (2.6). Then at each iteration of algorithm (1.2) 
with ak = ^/L, 

fixk+i) - fix,) < (1 - p)[fixk) - fix,)]. 
Proof. Because ||efc|p < Bk, Lemma 2.1 and (2.7) imply that 

fixk+i) - fix,) < (1 - ^l/L)[fixk) - fix,)] + ^Bk 

< (1 - ^l/L)[fixk) - fix,)] + iii/L - p)^k 

< (1 - ^i/L)[fixk) - fix,)] + ip/L - p)[fixk) - fix,)] 
= il-p)[fixk)-fix,)i 
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as required. □ 

As we did with Theorem 2.3, we consider the case where the approximate gradient is 
computed by a stochastic algorithm, and the error can only be bounded in expectation. 
Provided we now have a lower bound TTfe on the expected sub-optimality, i.e., 

0<7rfc <E[/(xfe)-/(x,)], 

it is possible to show an expected linear convergence rate that parallels Theorem 2.4. 
The proof follows that of Theorem 2.4, where we instead begin by taking the expectation 
of both sides of (2.1). 



Theorem 2.5 (Strong expected linear rate under expected error bounds). Sup- 
pose that E[||efcp] < Bf^ where Bj^ is given by (2.6). Then at each iteration of 
algorithm (1.2) with au = ^/L, 

E[/(xfc+i) - fix,)] < (1 - p)E[/(xfc) - fix,)]. 

In both scenarios, we obtain the fastest convergence rate in the extreme case where 
p = fJ-/L. From (2.6), this means that Bk must be zero, i.e., the gradient is exact, and 
we obtain the classic strong- linear convergence result with error constant a = 1 — /i/L 
stated in §1.4. However, if we take any positive p less than p/L, then we obtain a 
slower linear convergence rate but (2.6) allows Bj, to be non-zero (as long as tt^ > 0). 

The bound (2.6) on the error depends on both the conditioning of the problem 
(as determined by the parameter p and L), and on the lower bound on the distance 
to the optimal function value tt^. This seems intuitive. For example, we can allow 
a larger error in the gradient calculation the further the current iterate is from the 
optimal function value, but a more accurate calculation is needed to maintain the 
strong linear convergence rate as the iterates approach the solution. Similarly, if the 
problem is well conditioned so that the ratio p/L \s close to 1, a larger error in the 
gradient calculation is permitted, but for ill-conditioned problems where p/L is very 
small, we require a more accurate gradient calculation. 

Note that the analysis of this section holds even if the bounds /x, L, and tt^ are not 
the tightest possible. Unsurprisingly, we obtain the fastest convergence rate when p is 
as large and L is as small as possible, while the largest error in the gradient calculation 
is allowed if TTfc is similarly as large as possible. Note that with tt^ = 0, which is a 
trivial bound that is valid by definition, we require an exact gradient. 

Although it is difficult in general to obtain a bound Trj. that satisfies (2.7), there 
are at least two possible heuristics that could be used in practice to approximate 
this quantity. The first heuristic is possible if we have bounds on the Lipschitz and 
strong-convexity constants (L and //), as well as the norm of the gradient at Xk- In 
that case, we use the stationarity of a;*, (1.7a), and (1.11) to deduce that 

^||V/(:Efc)|P < ^\\xk - x,r < fixk) - fix,). 

Although we typically do not have access to ||V/(xfe)||, the norm of its approximation 
\\gk\\ is a reasonable proxy, which will improve as the magnitude of the gradient residual 
decreases. 

The second heuristic is based on the assumption that the distance of consecutive 
iterates to the solution x, decreases monotonically. In that case, 

\\xk - Xfe+ilP = Wixk - X,) - ixk+1 - x,)\\^ < 4\\xk - x,\\^. 
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Coupling this with (1.11) and premuhiplying by /i/8, we obtain the bound 

^\\Xk - Xk+lW^ < f{xk) - f{x*). 

This option may give a reasonable heuristic in the context of the increasing sample-size 
strategy even if the distance to the optimal solution does increase on some iterations, 
because as the sample size increases it becomes less likely that the distance will 
increase. 

2.3. RelELxing strong convexity. If we remove the assumption of strong convex- 
ity, the deterministic gradient method has a sublinear convergence rate of 0(l/fc) [26, 
§2.1.5] while the stochastic gradient method under standard assumptions has a slower 
sublinear convergence rate of 0{1/Vk) [24, §14.1]. Using an argument that does not 
rely on strong convexity, we show that the 0{l/k) convergence rate of the deterministic 
gradient method is preserved for the average of the iterates if the residuals of the 
computed gradients are summable. 



Theorem 2.6 (Sublinear rate under summable error bounds). Suppose that 
Si^o W^kW < oo. Then at each iteration of algorithm (1.2) with ak = ^/L, 



f{xk)- f{x,)=Oil/k), 



where Xk ■= (1/fc) Y.i=i 



Proof. As in (2.3), Lipschitz continuity of the gradient implies that 

f{xk+i) < f{xk) - ^WVfixkW + ^hkf. 

We use convexity of / to bound f{xk) and obtain 

f{xk+i) < fix,) + {xk - x,)^\/f{xk) - ^||V/(xfc)||2 + ^llefelp. 

Combine (1.2) and (1.3) to deduce that —Vf(xk) = Ck + L{xk+i — Xk), and move 
f{x,) to the left-hand side to get, after simplifying, 

f{xk+i) - f{x,) = ~{xk - a;*)^(efe + L[xk+i - Xk]) 

- ^\\xk+i - XkW'^ - [xk+i - Xk)'^ek 

= ^\\xk - a;*|p - (xfc - x,)'^ek - {xk+i - Xk)'^ek 

- ^\\xk - - L{xk - x,)'^{xk+i - Xk) - ^\\xk+i - XkW"^ 
< ^ [\\xk - - \\xk+i - + ||efc|| • \\xk+i - x,\\. 

Summing both sides up to iteration k we get 

J2 t-^^^') - ■^(^*)] f 11^0 - x,\f - ^\\xk-x,f + Y, ■ lla;. -x4. (2.8) 
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We will first use (2.8) to show that the sequence {\\xh — is bounded, and then 
use it to obtain the final result. 

Because the left-hand side of (2.8) is nonnegative, 



El 



^i— ill ' ||*^z 



We now prove that the sequence {||a:^fe — a;* ||} is bounded by showing that the auxiliary 
sequence {dk}, with dk max{l, \\xk — x■^,\\}, is bounded. Because dk > 1, 



dk < dl < max < 1, ||a:o — 



ill • \\Xi - X* 



i=l 
k 

< dl + ClY,\\e^-l\\ -dz, 



where Ci :— 2/L. Because the sequence {||efc||} is summable, it holds that ||efc|| — > 0, 
and thus that there exists some N large enough that Ci||efc_i|| < 1 for all fc > A^. 
Partition the sum at N: 



4 < Co + Ci ^ ||ei_i|| • d, for ah k> N, 



i=N 



where Co '■— (i§ + Ci Y^'^=i ll^i-ill ' di- Rearrange terms to get 



fc-i 



(l-Ci||efc_i||)dfc<Co + Ci^||e,_i||-d, for aU k>N, 



i=N 



and because Ci||efc_i|| < 1 for all k > N, 



dk < 



1 



k-l 



Co+C,J2\\^^-l\\d^ 



i=N 



1-Ci||efc_i 

If we apply this bound recursively to c?/c-i we obtain 



dk < 



1 



1-Ci||efc_ 



for all k > N. 



k~2 



ek-2\ 



Co + Ci^ ||e,_i||d. 



1 



(1-Ci||efe_i||)(l-Ci||efe_2||) 



fc-2 



Co + Ci^ ||e,_i||d. 



=N 



for all k> N, 



and if we apply it recursively from k down to N we obtain 

2Co . 2Co 



dk < — ^ 



< 



nt^(l-Ci||e,_i||)-nSA.(l-Ci||e,_i||)- 



(2.9) 
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To see that the right-hand side term is bounded, take logarithms of both sides to get 

CO 

logdfc < log(2Co) - log(l - 

We now use the hmit-comparison test, which asserts that for nonnegative sequences 
{ai} and {6^}, if < limi^oo ai/h < oo, then < °^ imphes J^i'^i < We 

thus compare the sequence {— log(l — Ci||ei_i||)} to the summable sequence {||ei_i||} 
using L'Hopital's rule to get 

^.^ -log(l -C,He._iH) ^ T^cfa ^ 

i^oo ll^'i — ill i— ^oo 1 

Thus, the sequence {— log(l — Ci||ei_i||)} is summable, which implies, via (2.9) and 
the definition of d^, that the sequence {||a;fc — x*||} is bounded. 

We now use convexity of / and (2.8) to bound the function value of the average 
iterate: 

k 

-/(x*)< ^5][/(x.)-/(x.)] 

i=l 

L 1 , ^ ^ 

i=l 

Because the sequence {||efc||} is summable (by assumption), and the sequence {||2:fc — 
a;*||} is bounded, this last inequality implies the conclusion of the theorem. □ 

Note that the convergence rate also holds for the iterate that achieves the lowest 
function value but, unlike the deterministic case, this is not guaranteed to be the last 
iteration. 

3. Application to sample-average gradients. Incremental-gradient methods 
for (1.1) are based on the iteration scheme (1.2) with the gradient approximation 

9k ■■= "^M^k), 

where the set Bk C {1, . . . , M} represents a sample of the measurements that constitute 
the full data set. Typically, Bk contains a single element that is chosen in either a 
cyclic fashion or sampled at random, and as discussed in §1.4, the convergence rate 
of the method is sublinear. As the sample size increases, however, the error in the 
sampled gradient decreases, and so the sample size can be used to implicitly control 
the error in the gradient. We use the results of §2 to develop an increasing sample-size 
strategy that improves on this sublinear rate. 

Let Mk denote the complement of Bk^ so that Bk U Nk — {1, • ■ • , ^i}- Then the 
gradient residual, defined by (1.3), satisfies 

The first term is a re-weighting of the gradient approximation and the second term is 
the portion of Vf{xk) that is not sampled. Assumption (1.8) allows us to bound the 
norm of this residual in terms of the full gradient, and thus leverage the results of §2. 
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3.1. Deterministic sampling. It follows from assumption (1.8) that the gradi- 
ent residual in (3.1) satisifies 



< 



^)Ev/.(..)-i,i:v/.M 



< 



< 4 



M\Bk 

M-\Bk\ 
M\Bk\ 

M^\Bk\ 
M\Bk\ 

M~\Bk\ 
M 



H 2 



2 



(/3l+/32||V/(Xfe)||2), 



where the triangle-inequality is applied repeatedly, and the resulting terms simplified. 
Next, in the same way that (2.4) was derived, we use (2.2a) to derive the upper bound 

llV/(a:fe)|p <2L[/(a:fe)-/(a;,)]. 

Thus the bound on ||efe||^ can be expressed in terms of the sample-size ratio and the 
distance to optimality: 



M-\Bk\ 



M 



(/3i+2/32£[/(a;fe)-/(a;,)]). 



(3.2) 



The following result parallels Theorem 2.2, and asserts that a linearly increasing 
sample size is sufficient to induce a weak linear convergence rate of the algorithm. 



Theorem 3.1 (Weak linear rate with deterministic sampling). Suppose that (1. 
holds, and that the sample size \Bk\ increases geometrically towards M, i.e., 



M - \Bk 
M 



0(//2) 



for some 7 < 1. Then at each iteration of algorithm (1.2) with ak = ^/L, for any 
e > we have 

f{xk) - fix,) = [/(xo) - /(x,)]0([l - ^i/L + ef) + 0{a'^) 
where a = max{7, 1 — ^/ L\ + e. 



Proof Let pk = "' j^ Using (3.2) and Lemma 2.1, we obtain the bound 

/(xfe+i) - /(x,) < (1 - p/L)[f{xk) ~ f(x,)] + ^(/3i + 2P2L[f{xk) - fix,)]) 
= (1 - p/L + AP2Pk)[fixk) - fix.)] + '^Pk 
= ujk[fixk) - fix,)] + '^Pk, 
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where u)k 1 ~ ^i/ L + ^P2Pk- Apply this recursively and use = 0{j'') to obtain 

fc-l k-l fe-l 

fix,) - fix.) <[f{xo) ^ f{x,)]'[[oj. + J20{f n ^J-)- 
Take ■— niax{7, uj,}- Because 7' = ]Xi=i 1 — YVj^i 

fc-l fe-l k-l 

fixk) - fix.) < [/(xo) - fix.)] n + E ^( n 

i=0 1=0 j = l 

fe-l fe-l 

= [/(:ro)-/(a;*)]n^' + ^(^n^: 
i=0 j = l 

Because pk 0, it follows that ojk 1 — ji/L and thus ni=o ^« ~ ^'([1 — /i/L + e]''') for 
any e > 0, which bounds the first term in the right-hand side above. Furthermore, we 
now use the fact that Sk \i S '■= max{7, 1 — fi/L} to show that the second term is in 
©([(J + e]*^) for any e > 0. In particular, choose N large enough that [Sk + Sk/k) <6 + e 
for all fc > iV and choose the constant ^ so that 

fe-l 

ki[s,<as+e)', 

for all k < N. Then by induction, fc 11^=1 = Oicr'^) for all k because 

fe fe-l 
ik + l)l[S, - i6k + 5k/k)kl[S, 

< (4 + 4/fc)e(5 + e)' 
<(^ + e)e(^ + e)'= 

= as+e)''+\ 

□ 

An interesting difference with Theorem 2.2, which considers a generic error in the 
gradient, is that the error in the objective function decreases at twice the rate that 
the sample size increases. 

It is possible to get a strong linear rate of convergence by increasing the sample 
size in a more controlled way. Theorem 2.4 guides the choice of the sample size, and 
gives the following corollary. The proof follows by simply ensuring that the right-hand 
side of (3.2) is bounded as required by Theorem 2.4. 

Corollary 3.2 (Strong linear rate with deterministic sampling). Suppose that 
(1.8) holds, and that the sample size \Bk\ is increased so that at each iteration 
fc = 0,l,..., 

^^^'' )' (/3i + 2/?2i[/(xfe) - fix.)]) < i(/i/i - p)[fixk) - fix.)] 

for some positive p < fj,/ L. Then at each iteration of algorithm (1.2) with ak = 1/i, 
fixk+i) - fix.) < (1 - p)[fixk) - fix.)]. 
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We see that if the individual functions fi are very similar (so that f3i and /32 are 
small) then we can choose a fairly small sample size. In contrast, if the fi are very 
dissimilar then we must use a larger sample size. 

3.2. Stochastic sampling. Theorem 3.1 and Corollary 3.2 are based on the 
deterministic bound (3.2) on the gradient error, and hold irrespective of the manner 
in which the elements of the samples Bk are chosen, e.g., the samples do not need to 
be chosen cyclically or sampled uniformly. We can obtain a strictly tighter bound in 
expectation, however, if we choose the sample by uniform sampling without replacement. 
In particular, by suitably modifying the derivation of a well-known result in statistical 
sampling, we can obtain a bound in terms of a quantity related to the sample variance 
of the gradients: 

M 



^fc-^^El|V/,(xfc)-V/(xfc) 

4=1 



Using an argument similar to [16, §2.7], uniform sampling without replacement yields 

■M-\Bk\\ Sk 



n\Wk\ 



By using (1.8), we obtain the bound 



M 



\Bk 



M-\Bk\ 1 
M ■ \Bk\ 



M 
M - 1 



An expected convergence result parallel to Theorem 3.1 can then be obtained with 
only minor changes to the proof. 

Theorem 3.3 (Weak linear rate with stochastic sampling). Suppose that (1.8 
holds, and that 



M-\Bk\ ^ 
■ \Bk 



M 



Oh" 



for some 7 < 1. Then at each iteration of algorithm (1.2) with Uk = 1/i, for any 
e > we have 

nfi^k) ~ /(.T*)] = [f{xo) - f{x.)]0{[l - li/L + e]") + 0{a") 
where a = max{7, 1 — ^/ L\ + e. 



Note that the right-hand side of the bound on ]E[||efc|p] is uniformly better than 
the bound shown in (3.2). Importantly, it initially decreases to zero at a faster rate 
as the size of the sample increases. Figure 3.1(a) illustrates the difference in the 
sample-size requirements between Theorems 3.1 and 3.3, i.e.. 



M-\Bk 



M 



(deterministic) vs. 



M-\Bk 
M 



1 



(stochastic). 



as the sample size \Bk\ — > M :— 10"*. Figure 3.1(b) illustrates the sample-size schedule 
needed to realize a linear convergence rate in the deterministic and stochastic cases. 
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Fig. 3.1. (a) Bounding factors in the error of the sampled gradient. The error bound for 
stochastic sampling is uniformly better than for deterministic sampling, (b) Minimum sample-size 
schedule required to achieve a linear rate with error constant 0.9. In both figures, M = 10^. 



4. Practical implementation. The analysis so far has focused on the approxi- 
mate gradient descent iterations given by (1.2)-(1.3). In practice, it is useful to make 
two modifications to this basic algorithm. The first, described by (1.5)-(1.6), is based 
on scaling the search directions to account for curvature information in /. The second 
allows for a varying step size. For this section, we define the sampled objective function 
and its gradient by 

h{x) = 5^ h{x) and gk{x) = t^^Y. ^M^)- (4-1) 

4.1. Scaled Direction. Our implementation attempts to gather useful curvature 
information on the function / by maintaining a quasi-Newton approximation to the 
Hessian Hk- The search directions dk are then taken as the solution of the system (1.6). 
The approximate Hessian is maintained via the recursive application of the update 

Hk+i = U{Hk,yk,Sk), 
where U represents an update formula on the fcth iteration, while 

Sfe := Xk+i ~ Xk and yk := gk{xk+i) - gk{xk) 

measure the change in x and the sampled gradient. In our experiments (see §5) we 
use a limited-memory quasi-Newton update, which maintains a history of the previous 
£ = 10 pairs {sk,yk), and recursively applies the formula iJi+i = U{Hi, Si,yi) for 
i = k — £, . . . ,k. Nocedal and Wright [28, §7.2] describe the recursive procedure for the 
limited-memory BFGS update that we use. Updates are skipped if necessary in order 
to ensure that the approximation Hk remains positive definite and has a bounded 
condition number. To ensure that dk is well-scaled, we use the Shanno and Phua [33] 
scaling of the initial Hessian approximation on each iteration as described by Nocedal 
and Wright [28, p. 178]. 

Another approach, not considered here, is to base the quasi-Newton Hessian on 
an independent set of sampled gradients [7]. 
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4.2. Varying stepsize. A weakness of our convergence analysis is the require- 
ment for a fixed steplength = L, in part because the Lipschitz constant is not 
usually known, and in part because a dynamic steplength is typically more effective in 
practice. But a linesearch procedure that ensures a sufficient decrease condition in 
the true function / runs contrary to a sampling scheme specifically designed to avoid 
expensive evaluations with /. 

In our implementation we attempt to strike a balance between a rigorous linesearch 
and none at all by enforcing an Armijo-type descent condition on the sampled objective 
/fc. In particular, we use a linesearch procedure to select a steplength ak that satisfies 

fkixk + adk) < fk{xk) + T]agk(xkfdk, (4.2) 

where dk is the current search direction and 77 e (0, 1). While in deterministic quasi- 
Newton methods we typically first test whether a ~ 1 satisfies this condition, in our 
implementation we set our initial trial step length to a = |Sfe_i|/|Sfc|. 

In general, Bk represents only a fraction of all observations, and so the above 
procedure may not even yield a decrease in the true objective at each iteration. But 
because we steadily increase the sample size, the procedure just described eventually 
reduces to a conventional linesearch based on the true objective function /. Hence, the 
method may initially be nonmonotic, but is guaranteed to eventually be monotonic. 
Coupled with the choice of search direction described in §4.1, the overall algorithm re- 
duces to a conventional linesearch method with a quasi-Newton Hessian approximation, 
and inherits the global and local convergence guarantees of that method. 

5. Numerical experiments. This section summarizes a series of numerical 
experiments in which we apply our incremental-gradient method with a growing 
sample size to a series of data-fitting applications. Table 5 summarizes the test 
problems. 

The first four experiments are data-fitting applications of logistic regression of 
varying complexity: binary, multinomial, chain-structured conditional random fields 
(CRFs), and general CRFs. These logistic- regression applications follow a standard 
pattern. We first model the probability of an outcome hi by some log-concave function 
p{bi I ai,x)^ where is data; the goal is to choose the parameters x so that the 
likelihood function 

j\/ 

C[x) = I ai,x) 

4=1 

is maximized. We then approximate x by minimizing the 2-norm-regularized negative 
log-likelihood function 

/(^) + 5-^11-^11' ^itli h{x) = -\ogp{h\a,,x) (5.1) 

i=l 

for some positive regularization parameter A. These objectives are all strongly convex 
(with > A) and satisfy our assumptions in §1.2. For some functions p the Lipschitz 
constant of V/, or upper bounds on it, are available. We discuss the particular case 
of the binary model in §5.1. 

The last experiment is a more general application of nonlinear least-squares 
to seismic inversion. This last data-fitting application does not satisfy our central 
convexity assumption, but nonetheless illustrates the practical relevance of our approach 
on difficult problems. 

Our numerical experiments compare the following three methods: 
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Table 5.1 
The test problems. 



Problem 



M 



n 



Description 



binary logistic regression 92,189 

multinomial logistic regression 70,000 

chain-structured CRF 8,936 

general CRF 50 

nonlinear least squares 101 



823,470 
785 
1,643,004 



10,201 



4 



spam identification (§5.1) 
digit identification (§5.2) 
noun-phrase chunking (§5.3) 
image denoising (§5.4) 
seismic inversion (§5.5) 



Deterministic. A conventional quasi-Newton linesearch method that uses the true 
function / and gradient V/ at every iteration. The method is based on a limited- 
memory BFGS Hessian approximation and a linesearch based on Hermite cubic- 
polynomial interpolation and the strong Wolfe conditions. Several comparison studies 
indicate that these type of limited-memory quasi-Newton methods are among the most 
efficient deterministic methods available for solving large-scale logistic regression and 
conditional random field problems [19, 36, 21, 32]. 

Stochastic. An incremental-gradient method based on the iteration (1.2) with 
Qk = V/i(a;/c) and a constant a^, where the index i € {1, . . . , M} is randomly selected. 
This corresponds to a constant sample size of one, and this simple method has 
proved competitive with more advanced deterministic methods like the one above for 
estimation in CRF models [35]. 

Hybrid. This is the proposed method described in §4, which uses search directions 
computed from (1.6) and a linesearch based on satisfying condition (4.2). As with 
the deterministic method described above, the Hessian approximations are based on 
limited-memory BFGS and the linesearch uses polynomial interpolation. The objective 
and gradient approximations are based on (4.1), where Bk ^ {1, • ■ • ,M}, and the 
number of elements in the current sample is initially 1, and grows linearly as per the 
formula 



The hybrid nature of this approach should now be clear; the very first iteration is 
similar to the stochastic method; when the sample size grows to include all observations, 
the algorithm morphs into the deterministic method. 

All experiments are carried out using Matlab R2010b on a 64-bit Athlon machine. 
Two plots are shown for each experiment. The first shows the progress of the objective 
value against the index P = J2k=o I'^^l for p = 1,2,..., which measures the effective 
number of passes through the entire data set. The second plot shows the cumulative 
number of fi functions evaluated, i.e., X)i=o I'^'l' a-ga-inst the iterations k. 

5.1. Binary logistic regression. Logistic regression models [5, §4.3.2] are used 
in an enormous number of applications for the problem of binary classification. We 
are given data with M examples of input-output pairs {ai,bi), where e M" is a 
vector of n features, and hi G {—1, 1} is a corresponding binary outcome. The goal is 
to build a linear classifier that, given the features ai and a vector of parameters x, the 
sign of the inner- product afx gives bi. The logistic model gives the probability that bi 
takes the value 1: 



\Bk+i\ = rmin{l.l- |6fc| + l, Af}l. 



pi{bi = 1 I ai,x) 



exp{afx) 1 



exp(af x) + 1 1 + exp(-a?x) ' 




(Typically a bias variable is added, but we equivaleiitly assume that the first element of 
X is set to one.) Thus, the probability that hi takes the value -1 is [1 — p{bi = I \ ai, x)\. 
We can write these two cases compactly as 

Pi{bi I a,,x) = — j 

1 + exp(-6ia/ X) 

This is the probability function p used in (5.1). 

The dominant cost in computing / and its gradient is the cost of forming the 
matrix- vector products Ax and A^y (for some y), where the M rows of the matrix 
A are formed from the vectors a^. The Hessian is V^/(x) = A^DA, where D is a 
diagonal with elements pi{bi \ ai,x) ■ [1 — pi{bi \ ai,x)], which lie in the range (0,0.25]. 
The (nonnegative) eigenvalues of the Hessian are thus bounded above by 0.25||A|p, 
and are strictly positive if A has full rank. Combined with 2-norm regularization, the 
resulting function / satisfies the assumptions of §1.2. 

Our experiments for binary logistic regression are based on the TREC 2005 data 
set, which contains 823,470 binary variables describing the presence of word tokens in 
92,189 email messages involved in the legal investigations of the Enron corporation [9]. 
The target variable indicates whether the email was spam or not. The data set was 
prepared by Carbonetto [8, §2.6.5], and we set the regularization parameter A — 0.01. 

The results of this experiment are plotted in Figure 5.1, where we define the 
optimal value as the best value found across the methods after 150 effective passes 
through the data, and where for the stochastic method we plot the three step sizes 
(among powers of 10) that gave the best performance over the allotted iterations. Of 
course, in practice we will not know what step size optimizes the performance of the 
stochastic method, and the lack of sensitivity of the result to the initial step size is 
an advantage of the deterministic and hybrid methods. In the plot we see that, like 
the stochastic methods, the hybrid method makes rapid initial progress. Unlike the 
stochastic method, however, the hybrid method continues to make steady progress 
similar to the deterministic method. This behavior agrees with the theory. 

5.2. Multinomial logistic regression. Multinomial logistic regression relaxes 
the binary requirement, and allows each outcome bi to take any value from a set of 
classes C [5, §4.3.4]. In this model there is a separate parameter vector xj for each 
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Fig. 5.2. Multinomial logistic regression experiments for different optimization strategies for 
digit classification. 



class j ^C. We model the probability that bi is assigned a particular class j as 

(u -I r 1 ^ exp(a;Jai) 

P2{h = J I {xj\jec) = ^ -TT^- (5-2) 

Ei'GC exp(a;j,oJ 

This model is equivalent to binary logistic regression in the special case where the 
parameters Xj of one class are fixed at zero and there are only two classes, i.e., 
C = {— 1,1}. As with binary logistic regression, the function p2 is log-concave and 
— \0gp2 has a Hessian whose eigenvalues are bounded above. Hence, the resulting 
function / in (5.1) satisfies the assumptions of §1.2. 

Our experiments for multinomial logistic regression are based on the well-known 
MNIST data set [15], containing 70,000 examples of 28-by-28 images of digits, where 
each digit is classified as one of the numbers through 9. The results are plotted 
in Figure 5.2, with A = 1. The trends are similar to the binary logistic regression 
experiments. 

5.3. Chain-structured conditional random fields. The binary and multi- 
nomial logistic regression models consider the case where a particular outcome hi is 
associated with a particular feature vector a^. The CRF model takes multinomial 
logistic regression a step further by considering a set of feature vectors with which 
to predict corresponding values for discrete variables € C, where k takes values in a 
discrete ordered set Vl. 

We can naturally extend the multinomial logistic model to this scenario by defining 
the probability of a joint assignment bi as 

= ]k}ken I {a^i}ken. {^^jbec) = ^ IT exp(a;J,a,^). 

* ken 

(We assume that the parameter vectors Xj^, are tied, so that Xj^ is constant for all k] 
this assumption is not required in general.) The normalizing constant 

iieCi2eC jkecken 

is chosen so that the distribution sums to one over all possible configurations of the 5f 
variables. As written, computing Zi involves a very large number of terms, |C|I^I; still, 
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Fig. 5.3. Chain-structured conditional random field experiments for different optimization 
strategies on the noun-phrase chunking task. 



the sum can be computed efficiently by exchanging the order of operations. While 
this model is a straight-forward generalization of multinomial logistic regression, it 
assumes that the labels in that we are simultaneously predicting are independent. 
This might be unrealistic if, for example, the variables come from time-series data 
where b'l and b'^'^^ are likely to be correlated. 

A chain-structured CRF [14] augments the model with additional terms that take 
into account sequential dependencies in the labels. It allows pairwise features a^'^ and 
associated parameters Xkk' , and defines the probability of a configuration as 



P3{{bt =j}ken I {a*}fegn,{af'' }k.k'en,k'=k+i,{xj}jec,{xj.j'}j.j'ec) 



1 



n exp(a;J,a - ) 



feen 



n exp(xj;,^,af' ) 



The normalizing constant Z,; is again set so that the distribution sums to one, and it 
can be computed using a variant on the forward-backward algorithm used in hidden 
Markov models [30, § III]. Because we need to run the forward-backward algorithm 
for each i, the probability function is significantly more expensive to evaluate than 
the corresponding multinomial probability function see (5.2). 

For our chain-structured CRF experiments, we use the noun-phrase chunking 
problem from the CoNLL-2000 Shared Task [31], where the goal is to assign each word 
in a sentence to one of 22 possible states, and we use approximately 1.6 million features 
to represent the presence of words and word combinations [31, 32]. The CoNLL-2000 
data set contains 211,727 words grouped into 8,936 sentences. The results of this 
experiment are plotted in Figure 5.3, where the regularization parameter is A = 1. 



5.4. General conditional random fields. While chain-structures can model 
sequential dependencies in the labels, we might be interested in other structures, such 
as lattice structures for modeling image data. In order to use general CRFs [12, §4.6,1 
and §20.3.2], we define a graph Q where the nodes are the labels {1, 2, . . . , w} and 
the edges are the variable dependencies that we wish to consider. We then define the 
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Fig. 5.4. Lattice- structured conditional random field experiments for different optimization 
strategies for an image denoising task. 



probability of bi taking a configuration = j^, for fc € fi, as 

Piii^i ^jkjken I {di }k'en,Wi'' }{k,k')e£,{xj}jec,{^]j'}u,j')ec) 



1 



n exp(a;J^a-') 



ken 



n exp(a::^ 

{k,k')e£ 



In this general case, computing Zi is in the complexity class jJP, and the best 
known algorithms have a runtime that is exponential in the tree-width of the graph 
G [12, §9-10]. For a two-dimensional lattice structure, the tree- width is the minimum 
between the two dimensions of the structure, so computing Zi is only feasible if one of 
the dimensions is very small (in the degenerate one-dimensional chain-structured case, 
the tree- width is one). 

Because of the intractability of computing Zi , we consider optimizing a pseudo- 
likelihood approximation [4] based on the probability model 

Pi{{bi ^jk}ken I {ttijkenAo'i'' }{k,k')e£,{x3}jec,{xjf}(j^j')ec) 

~ Y[pibi ^ ik I {a't'}k'en,{a'l^'}(k,k')e£A^]}jec,{xjy}[j,f)ecAb^}k'en,k'^k)- 
ken 

The individual terms in this product of conditionals have the form of a multinomial 
logistic regression probability and are straightforward to compute. This is the function 
p used to define the objective function / in (5.1). 

Our experiments on general CRFs are based on the image-denoising experiments 
described by Kumar and Hebert [13]. We use their set of 50 synthetic 64-by-64 images. 
Figure 5.4 shows the performance of the different methods with a regularization 
parameter of A = 1. Figure 5.5 illustrates the marginal probabilities for the different 
methods at various points in the optimization for a randomly-chosen image in the 
data set. (For the stochastic method, we plot the result with a step size of a = 10"'*.) 
To approximate these marginals, we use the loopy belief propagation message-passing 
algorithm [5, §8.4.7]. In these plots we see that the deterministic method does poorly 
even after two full passes through the data set, while the stochastic and hybrid methods 
do much better. After five passes through the data set, the hybrid method has found 
a solution that is visually nearly indistinguishable from the true solution, while it is 
still possible to see obvious differences in the deterministic and stochastic methods. 
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(a) original image 



(b) noisy image 




(d) 



(e) 



(g) 



(h) 



Fig. 5.5. Top row: original (a) and noisy (b) image. Second row: marginals after 2 passes 
through the data for deterministic (c), stochastic (d), and hybrid (e). Third row: marginals after 5 
passes through the data for deterministic (f), stochastic (g), and hybrid (h). 

5.5. Seismic inversion. This last numerical experiment is a seismic inversion 
problem described by van Leeuwen et al. [34]. The aim here is to recover an image 
of underground geological structures using only data collected by geophone receivers 
placed at the surface of the earth; these geophones record acoustic "shots" created by 
sources also at the surface. 

The waveform inversion problem attempts to find a model x of the subsurface 
structure that minimizes the nonlinear least-squares misfit as measured by the function 

M 



i—l a;6zO 



Each index i corresponds to a particular shot (i.e., an observation) created by the 
source qi which creates a measurement di] each experiment samples a set of frequencies 
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Fig. 5.6. Nonlinear least-squares experiments for different optimization strategies on a seismic 
inversion problem. 



Lj G SI. The matrix P samples the wavefield at the receiver locations. The main 
cost in evaluating (f) is solving the Helmholtz equation Hi^[x]u — qi for each i, which 
is an expensive partial differential equation. Regularization is typically achieved by 
truncating the solution process [34] . Although the function <j) is nonconvex and does not 
satisfy our standing assumptions, it hints at the applicability of the hybrid approach 
for solving difficult problems of important practical interest. 

The results shown in Figure 5.6 are based on a relatively small 2-dimensional 
example that involves M=101 sources measuring 8 frequencies. (Larger experiments, 
especially in 3-dimensions, are only feasibly accomplished on a computing cluster.) 
Because the problem is not strongly convex and hence we do not expect a fast 
convergence rate when far from this solution, for this problem the hybrid method 
increases the sample size by only one sample at each iteration, i.e., = min{|Sfc| + 

1, M}. After 20 passes through the data, the hybrid method clearly outperforms the 
deterministic method. The best stochastic method performs nearly as well as the 
deterministic and hybrid methods for the first 60 iterations, though with other step 
sizes it performs poorly. Although the figure shows the best methods all achieving 
similar residuals after 60 passes through the data, in practice that involves a prohibitive 
number of Helmholtz solves; practitioners are interested in making quick progress with 
as few solves as possible (and without needing to test a variety of step sizes to achieve 
good performance). 

6. Discussion. Our work has focused on inexact gradient methods for the un- 
constrained optimization of differentiable strongly-convex objectives. We anticipate 
that a similar convergence analysis with inexact gradients could be applied to other 
algorithms, such as Nesterov's accelerated gradient method [26, §2.1]. In this case, it 
may be possible to relax the strong convexity assumption, and obtain the optimal 
©(l/fc^) rate for an inexact-gradient version of this algorithm. The optimal ©(l/fc^) 
rate using a gradient approximation whose error is uniformly bounded across iterations 
has been established by several authors, notably d'Aspremont [10]. But allowing a 
variable error would encourage more flexibility in the early iterations, and would allow 
for eventually solving the problem to arbitrary accuracy. Although the ©(l/fc^) rate is 
sub- linear, it is substantially faster than the optimal 0{l/\/k) achievable by methods 
that use noisy gradient information [24, §14.1]. 

We might also obtain analogous rates for proximal-gradient methods for optimiza- 
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tion with convex constraints or non-difFerentiable composite optimization problems, 
such as 1-norm regularization [27]. The more general class of mirror-descent meth- 
ods [1] , which are useful for problems with a certain geometry such as optimization with 
simplex constraints, also seem amenable to analysis in our controlled error scenario. 

We considered the case of bounded noise or noise that can be bounded in expec- 
tation, and subsequently derived convergence rates and expected convergence rates, 
respectively. We might also consider the case where the noise is bounded with a certain 
probability. If the individual V fi{x) are concentrated around V/(a;), this might allow 
us to use concentration inequalities [20] to show that the convergence rates hold with 
high probability. Although we have analyzed an arbitrary strategy for selecting the 
elements of the sample, and shown that uniform sampling achieves a better bound in 
expectation, it is possible that a quasi-random selection of the individual gradients 
might further refine the bound [22]. 

Although our emphasis here is on data fitting applications where the error is a 
by-product of subsampling the data, our analysis and implementation may be useful 
for other problems. For example, Gill et al. [11, p. 357] discuss the case of an objective 
function that can be evaluated to a prescribed accuracy (e.g., it could depend on 
an iterative process or a discretization level). They suggest solving the optimization 
problem over a sequence of tighter function accuracies. Our work provides a formal 
analysis and practical implementation of a method where the accuracy might be 
increased dynamically each iteration rather than solving a sequence of intermediate 
optimization problems. As a more recent example, Poyiadjis et al. [29] consider 
approximating gradients in non-Gaussian state-space models using particle filters. 
Here, the variance of the approximation is directly proportional to the number of 
particles, and thus our work provides guidelines for selecting the number of particles 
to use in the approximation on each iteration. 
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